Problem 6.1

Attribution: Elena and Shashank

First we import the necessary modules and read in the data.

In [2]:
import itertools
import warnings
import time

# Our numerical workhorses
import numpy as np
from numpy import trapz
import pandas as pd
import scipy.stats as st
import scipy.special

# The MCMC Hammer
import emcee

# BE/Bi 103 utilities
import bebi103

# Import plotting tools
import matplotlib.pyplot as plt
import seaborn as sns
import corner

# Magic function to make matplotlib inline; other style specs must come AFTER
%matplotlib inline

# This enables high res graphics inline (only use with static plots (non-Bokeh))
# SVG is preferred, but there is a bug in Jupyter with vertical lines
%config InlineBackend.figure_formats = {'png', 'retina'}

# JB's favorite Seaborn settings for notebooks
rc = {'lines.linewidth': 2, 
      'axes.labelsize': 18, 
      'axes.titlesize': 18, 
      'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
sns.set_style('darkgrid', rc=rc)

# Suppress future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
In [3]:
df = pd.read_csv("./data/gardner_hw6/gardner_mt_catastrophe_only_tubulin.csv", comment="#")

Part A

The probability distribution for catastrophe times for a three-step process ($m=3$) has been defined as:

\begin{align} P(t\mid \tau_1, \tau_2, \tau_3, 3, I) = \frac{1}{\tau_1\tau_2\tau_3}\int_0^t\mathrm{d}t_1 \int_{t_1}^t\mathrm{d}t_2\, \mathrm{e}^{-t_1/\tau_1}\,\mathrm{e}^{-(t_2-t_1)/\tau_2}\,\mathrm{e}^{-(t-t_2)/\tau_3}. \end{align}

This is because we are working under the assumption that each subprocess that contributes towards a catastrophe event is essentially a Poisson process, with each subprocess occuring independently of each other at a defined rate $1/\tau_j$. For a three step process, the number of independent random variables $(X_i)$ would be two, which explains why we are integrating over $t_1$ to t and then 0 to t.

We can simplify the above expression by pairing $\tau_1$, $\tau_2$, and $\tau_3$ with their respective exponential terms to get:

\begin{align} P(t\mid \tau_1, \tau_2, \tau_3, 3, I) = \int_0^t\mathrm{d}t_1 \int_{t_1}^t\, \big[ R_1 . \mathrm{e}^{-R_1 . \Delta t_1}\,R_2 . \mathrm{e}^{-R_2 . \Delta t_2}\,R_3 . \mathrm{e}^{-R_3 . \Delta t_3}\big] \mathrm{d}t_2 \\ \text {where} && R_1 = \frac{1}{\tau_1}, \,\,R_2 = \frac{1}{\tau_2},\,\, R_3 = \frac{1}{\tau_3} \\ \\ \text {and} && \Delta t_1 = t_1,\,\, \Delta t_2 = t_2-t_1, \,\, \Delta t_3 = t - t_2 \end{align}

If the three subprocesses take $t_1$, $t_2$, and $t_3$ to complete, then the waiting time for the first subprocess, between the first and the second subprocess, and between the second and the third subprocess would be $t_1$, $t_2-t_1$, and $t-t_2$, respectively. These waiting times will then be a part of an exponential distribution, with parameter $\lambda_i$, which in this case is $1/\tau_i$, where $i \in \{1,2\dots m\}$. Hence, the expression for the probability distribution for catastrophe times is nothing but the product of three exponential functions (waiting times between three independent Poisson processes) with parameters as described above.

We solve the second integral (containing $\mathrm{d}t_2$) before solving the first one because $t_1$ is contained within $t_2$.

Part B

We generalized the expression for the probability distribution for catastrophe times for $m$ subprocesses and were left with: \begin{align} \ P(t\mid {\tauj}, m, I) = \sum{j=1}^m \frac{\tauj^{m-2}\,\mathrm{e}^{-t/\tau_j}}{\prod{k=1,k\ne j}^m (\tau_j - \tau_k)} \ \end{align}

Using this expression, the probability distributions for $m=1$ and $m=2$ were found to be:

\begin{align} P(t\mid \tau_1, 1, I) &= \frac{\mathrm{e}^{-t/\tau_1}}{\tau_1} &&\text{and},\\[1em] P(t\mid \tau_1, \tau_2, 2, I) &= \frac{\mathrm{e}^{-t/\tau_2} - \mathrm{e}^{-t/\tau_1}}{\tau_2 - \tau_1} \\[1em] \end{align}

Let's evaluate the probability distribution for $m=2$ for two arbitrary values of $\tau_1$ = 2 and $\tau_2$ = 5, satisfying the condition $\tau_1$ < $\tau_2$.

\begin{align} P(t\mid \tau_1 = 2, \tau_2 = 5, 2, I) &= \frac{\mathrm{e}^{-t/2}}{2 - 5} + \frac{\mathrm{e}^{-t/5}}{5 - 2} = \frac{\mathrm{e}^{-t/5} - \mathrm{e}^{-t/2}}{3} &&\text{for all t}\in D\\[1em] \end{align}

Now let's switch the values for $\tau_1$ and $\tau_2$ such that $\tau_1$ = 5 and $\tau_2$ = 2, satisfying the condition $\tau_1$ > $\tau_2$.

\begin{align} P(t\mid \tau_1 = 5, \tau_2 = 2, 2, I) &= \frac{\mathrm{e}^{-t/5}}{5 - 2} + \frac{\mathrm{e}^{-t/2}}{2 - 5} = \frac{\mathrm{e}^{-t/5} - \mathrm{e}^{-t/2}}{3} &&\text{for all t}\in D\\[1em] \end{align}

As can be seen above, the probability distribution function is unable to differentiate between the two set of values for $\tau$. In order to avoid having this degeneracy, we will have to select one of the two conditions, either $\tau_i$ < $\tau_{i+1}$, or $\tau_i$ > $\tau_{i+1}$, for $i \in$ len(D). However, we should take note of the fact that even though the order of these rates ($\tau$) might be important biologically (i.e. one process has to occur in order for the second one to start), the order does not matter from a mathematical point of view (Commutative propoerty of multiplication). Hence, we specify that $\tau_1 < \tau_2 < \tau_3 \dots \tau_n$ before doing model selection and parameter estimation.

Part C

Let's take a quick look at the data.

In [268]:
df.head()
Out[268]:
12 uM 7 uM 9 uM 10 uM 14 uM
0 25.000 35 25 50 60
1 40.000 45 40 60 75
2 40.000 50 40 60 75
3 45.429 50 45 75 85
4 50.000 55 50 75 115

We define the log likelihood for PTMCMC as the following:

\begin{align} P(\left\{t_i\right\}\mid \{\tau_j\}, m, I) = \prod_i P(t_i\mid \{\tau_j\}, m, I),\,\,\text{where} \end{align}\begin{align} P(t_i\mid \{\tau_j\}, m, I) = \sum_{j=1}^m \frac{\tau_j^{m-2}\,\mathrm{e}^{-t_i/\tau_j}}{\prod_{k=1,k\ne j}^m (\tau_j - \tau_k)}. \end{align}

For the prior, we use Jeffreys priors for all $\tau_j$ because $\tau$ is a scale parameter. We also normalize the prior in order to use PTMCMC for parameter estimation. In the prior, we also ensure that $\{\tau_j\}$ is ordered least to greatest as described above (we reject it if it is not) and that all $\tau_j$ are within the boundaries we defined as the minimum and maximum reasonable values. (We chose 1 second as the minimum and 1800 seconds as the maximum since those values encompass the range of observed catastrophe times.)

We also wrote a function to generate starting values for $\{\tau_j\}$ for any value of $m$. We decided to draw from an exponential distribution with mean = 300 because the mean catastrophe time was ~380, and we figured that $\tau$ should therefore be on the order of ~300-400 at least in the case of $m = 1$.

In [499]:
def log_likelihood(tau, t, m):
    """
    Takes in tau (should be an array of length m), 
    data (Pandas data series), various values for m. 
    We are very proud of this function.
    """   
    # set up array to store the log likelihoods calculated for each
    # value of t
    log_like = np.empty((len(t), 1))
    
    num_handler = np.empty((len(t), m)) # the numerator of the scaling factor
    exp_handler = np.empty((len(t), m)) # the exponent part of the numerators
    den_elem = np.empty(m) # temporary array to store elements of the denominator 
                           # of the scaling factor
    den_handler = np.empty((len(t), m)) # denominator of the scaling factor
    
    # iterate over j for all data points 
    for j in range(m):
        num_handler[:,j] = tau[j]**(m-2)
        exp_handler[:,j] = -t/tau[j]
        den_elem = tau[j] - tau
        # use only non-zero elements when taking the product 
        # in the denominator of the scaling factor 
        den_prod = np.prod(den_elem[:j]) * np.prod(den_elem[j+1:])
        den_handler[:,j] = den_prod 
        den_elem = np.empty(m)

    scaling_factor = num_handler / den_handler

    # store the log likelihoods in an array 
    log_like = scipy.misc.logsumexp(exp_handler, axis=1, b=scaling_factor)            

    # return the sum of the log likelihoods
    return np.sum(log_like)

def log_prior(tau, m, tau_min, tau_max):
    """
    Log prior for model defined above. Takes in m and tau, 
    returns log prior.
    """
    # check that tau is ordered least to greatest;
    # also make sure the gap between each pair of tau values is
    # at least 1e-6 to help avoid precision issues when
    # calculating the log likelihood
    for i in range(m-1):
        if not (tau[i] + 1e-6) < tau[i+1]:
            return -np.inf
    
    # check that tau is not outside the boundaries we define
    if tau[0] < tau_min or tau[-1] > tau_max:
        return -np.inf
    
    prior = np.empty(m)
    # calculate and store the elements of the normalized prior 
    # associated with each tau 
    for i in range(m):
        if m == 1:
            prior[i] = tau[i] * np.log(tau_max / tau_min)
        
        elif m > 1 and i == 0:
            prior[i] = tau[i] * np.log(tau[i+1] / tau_min)
        
        elif m > 1 and i == m-1:
            prior[i] = tau[i] * np.log(tau_max / tau[i-1])
        
        else:
            prior[i] = tau[i] * np.log(tau[i+1] / tau[i-1])
            
    return -np.log(np.prod(prior))
        
def tau_start(m, n_walkers, n_temps):
    """
    Generates starting points for each tau.
    """
    p = np.empty((n_temps, n_walkers, m))
    
    for i in range(m):
        p[:,:,i] = np.random.exponential(300, (n_temps,n_walkers))
    return p

We wrote a function to help us run PTMCMC.

In [485]:
def sample_ptmcmc(data, m, model, tau_min=1, tau_max=1800, n_temps=20, n_walkers=100, n_burn=100, 
                  n_steps=500, threads=None):
    """
    Sample posterior using PTMCMC.
    """
    # arguments for likelihood
    loglargs = (data, m)
    
    # arguments for prior
    logpargs = (m, tau_min, tau_max)
    
    # starting points for the parameters
    p0 = tau_start(m, n_walkers, n_temps)
    
    # column headings for outputted DataFrames
    columns = {'m = 1': ['tau_1'],
               'm = 2': ['tau_1', 'tau_2'],
               'm = 3': ['tau_1', 'tau_2', 'tau_3'],
               'm = 4': ['tau_1', 'tau_2', 'tau_3', 'tau_4'],
               'm = 5': ['tau_1', 'tau_2', 'tau_3', 'tau_4', 'tau_5']}
    
    return bebi103.run_pt_emcee(
            log_likelihood, log_prior, n_burn, n_steps,
            n_temps=n_temps, p0=p0, loglargs=loglargs, logpargs=logpargs, 
            threads=threads, columns=columns[model], return_lnZ=True)

We ran PTMCMC for five different models, $m = 1$ through $m = 5$.

In [6]:
start = time.time()
df_1, lnZ_1, dlnZ_1 = sample_ptmcmc(df['12 uM'], 1, 'm = 1', 
                                    n_temps=20, n_walkers=100, 
                                    n_burn=1000, n_steps=5000, threads=3)
print('The first PTMCMC took ', time.time()-start, ' seconds.')

start = time.time()
df_2, lnZ_2, dlnZ_2 = sample_ptmcmc(df['12 uM'], 2, 'm = 2', 
                                    n_temps=20, n_walkers=100, 
                                    n_burn=1000, n_steps=5000, threads=3)
print('The second PTMCMC took ', time.time()-start, ' seconds.')

start = time.time()
df_3, lnZ_3, dlnZ_3 = sample_ptmcmc(df['12 uM'], 3, 'm = 3', 
                                    n_temps=20, n_walkers=100, 
                                    n_burn=1000, n_steps=5000, threads=3)
print('The third PTMCMC took ', time.time()-start, ' seconds.')

start = time.time()
df_4, lnZ_4, dlnZ_4 = sample_ptmcmc(df['12 uM'], 4, 'm = 4', 
                                    n_temps=20, n_walkers=100, 
                                    n_burn=1000, n_steps=5000, threads=3)
print('The fourth PTMCMC took ', time.time()-start, ' seconds.')

start = time.time()
df_5, lnZ_5, dlnZ_5 = sample_ptmcmc(df['12 uM'], 5, 'm = 5', 
                                    n_temps=20, n_walkers=100, 
                                    n_burn=1000, n_steps=5000, threads=3)
print('The fifth PTMCMC took ', time.time()-start, ' seconds.')
The first PTMCMC took  1116.4163601398468  seconds.
The second PTMCMC took  1351.2568359375  seconds.
The third PTMCMC took  1094.7742569446564  seconds.
The fourth PTMCMC took  1206.7325630187988  seconds.
The fifth PTMCMC took  843.8138411045074  seconds.

Let's take a look at the results using corner plots.

In [382]:
_ = corner.corner(df_1[df_1.beta_ind==0][['tau_1']], bins=50)
In [383]:
_ = corner.corner(df_2[df_2.beta_ind==0][['tau_1', 'tau_2']], bins=50)
In [21]:
_ = corner.corner(df_3[df_3.beta_ind==0][['tau_1', 'tau_2', 'tau_3']], bins=50)
In [385]:
_ = corner.corner(df_4[df_4.beta_ind==0][['tau_1', 'tau_2', 'tau_3', 'tau_4']], bins=50)
In [386]:
_ = corner.corner(df_5[df_5.beta_ind==0][['tau_1', 'tau_2', 'tau_3', 'tau_4', 'tau_5']], bins=50)

It is interesting that for $m > 3$, the smallest values of $\tau_j$ start to look exponentially distributed with the means at relatively small values, rather than having unimodal posteriors. Our intuition at this point was that this could mean there is little benefit to adding more $\tau$s after $m = 3$.

So, we next wanted to calculate the odds ratios for our models. However, lnZ for each of our models was returned as -inf. In the course of troubleshooting this issue, one of the first things we did was take a look at the end of the PTMCMC DataFrame for one of our models.

In [22]:
df_3.tail()
Out[22]:
tau_1 tau_2 tau_3 lnlike lnprob beta_ind beta chain
9999995 589.971564 112.345049 71.016021 -inf -inf 19 4.314151e-13 99
9999996 589.971564 112.345049 71.016021 -inf -inf 19 4.314151e-13 99
9999997 589.971564 112.345049 71.016021 -inf -inf 19 4.314151e-13 99
9999998 589.971564 112.345049 71.016021 -inf -inf 19 4.314151e-13 99
9999999 589.971564 112.345049 71.016021 -inf -inf 19 4.314151e-13 99

This was very strange to us: we should not be getting -inf for any of our log likelihoods. Moreover, we found that more than half of our samples had this issue:

In [23]:
print("""Percentage of samples with lnlike = -inf: {0:.2f}%""".\
      format(100 - (len(df_3[df_3.lnlike != -np.inf]) / len(df_3) * 100)))
Percentage of samples with lnlike = -inf: 54.54%

We double- and triple-checked our log likelihood function. While it could return -inf if scipy.misc.logsumexp returns a 0 at any point, this does not appear to be the problem here because if we run our log_likelihood function with the values of tau_1, tau_2, and tau_3 recorded for the last sample (for which lnlike was reported as -inf), we get a finite value:

In [24]:
log_likelihood(df_3.values[-1][:3], df['12 uM'], 3)
Out[24]:
-4896.9814358472122

However, we note that our values of $\tau_j$ are not in proper ranked order for those samples that have lnlike = -inf. (This appeared to be generally true, although above we only show the end of the DataFrame as an example.) This was again surprising to us because we had thought emcee would not record samples for which the prior was evaluated as -inf. Just to double-check our prior, we ran our log_prior function with the same values of tau we tested in the log_likelihood function above, and as expected, we got -inf. This prompted us to look at the source code for the PT sampler in emcee. We discovered in lines 90-94 that in the case of the log prior returning -inf, emcee records both the log likelihood and the log prior as -inf (it does not actually evaluate the log likelihood function in this case - which indeed it shouldn't). Then, reading carefully lines 252-270 and 318-362, we found that when emcee evaluates whether to accept a step, it definitely rejects the step if the log prior is -inf; however, while we are probably misinterpreting the code, it seems to us that the log likelihood for that sample would also still be recorded as -inf... We expect that there is something subtle wrong with our prior, since it seems very unlikely that emcee has a bug (plus it worked for Justin). However, neither we nor the TAs could figure out what was wrong.

At this point, we began to think that we might need to just drop all of the samples with lnlike = -inf from our PTMCMC DataFrame before calculating lnZ, since those samples should have been rejected anyway. Dropping these samples will not affect our posterior distributions since we only take the distributions from the first temperature, which does not have any lnlike = -inf. We confirmed this by making the corner plot again after dropping all rows with -inf.

In [25]:
_ = corner.corner(df_3[(df_3.beta_ind==0) & (df_3.lnlike != -np.inf)][['tau_1', 'tau_2', 'tau_3']], bins=50)

So then for each model, we took the DataFrame without the problematic samples and calculated lnZ using Justin's handy function that was posted on Piazza.

In [26]:
df1_no_inf = df_1[df_1['lnlike'] != -np.inf]
df2_no_inf = df_2[df_2['lnlike'] != -np.inf]
df3_no_inf = df_3[df_3['lnlike'] != -np.inf]
df4_no_inf = df_4[df_4['lnlike'] != -np.inf]
df5_no_inf = df_5[df_5['lnlike'] != -np.inf]

def lnZ(df_mcmc):
    """
    Compute log Z(1) from PTMCMC traces stored in DataFrame.
    """
    # Average the log likelihood over the samples
    log_mean = np.zeros(len(df_mcmc['beta_ind'].unique()))
    for i, b in enumerate(df_mcmc['beta_ind'].unique()):
        log_mean[i] = df_mcmc['lnlike'][df_mcmc['beta_ind']==b].mean()

    # Set of betas (temperatures)
    betas = np.concatenate((np.array(df_mcmc['beta'].unique()), (0,)))

    # Approximate quadrature
    return -np.dot(log_mean, np.diff(betas))

lnz1 = lnZ(df1_no_inf)
lnz2 = lnZ(df2_no_inf)
lnz3 = lnZ(df3_no_inf)
lnz4 = lnZ(df4_no_inf)
lnz5 = lnZ(df5_no_inf)

Now we can calculate odds ratios.

In [27]:
print('Odds ratio for m = 1 vs m = 2:', np.exp(lnz1 - lnz2))
print('Odds ratio for m = 2 vs m = 3:', np.exp(lnz2 - lnz3))
print('Odds ratio for m = 3 vs m = 4:', np.exp(lnz3 - lnz4))
Odds ratio for m = 1 vs m = 2: 1.87662293233e-61
Odds ratio for m = 2 vs m = 3: 1.49261650839e-11
Odds ratio for m = 3 vs m = 4: 2.0348352208

So far it looks like $m = 3$ gives the best model. Just to double-check, let's calculate the odds ratio for $m = 3$ vs $m = 5$.

In [28]:
print('Odds ratio for m = 3 vs m = 5:', np.exp(lnz3 - lnz5))
Odds ratio for m = 3 vs m = 5: 6237.78448593

So it looks like $m = 3$ does indeed give the best model. Actually, there is not much of a difference between $m = 3$, $m = 4$, and $m = 5$. However, in part (d) we will stick with $m = 3$ since it does have the best odds of those three due to being the least complex model.

Next we attempted to get the MAP for each $\tau_j$ in the $m = 3$ model by restricting our PTMCMC DataFrame to the first temperature and finding the row with the highest value for lnprob. However, this gave us values which did not at all correspond to the peaks of our histograms for the marginalized posterior distributions of $\tau_1$, $\tau_2$, and $\tau_3$.

In [29]:
# Get the index of the most probable parameter set
df3_map = df_3[(df_3.beta_ind==0) & (df_3.lnlike != -np.inf)]
max_ind = np.argmax(df3_map.lnprob.values)

# Pull out values.
tau1_MAP, tau2_MAP, tau3_MAP = df3_map.values[max_ind, :3]

# Print the results
print("""Most probable parameter values for m = 3:
tau_1:  {0:.3f}
tau_2: {1:.3f}
tau_3: {2:.3f}""".format(tau1_MAP, tau2_MAP, tau3_MAP))
Most probable parameter values for m = 3:
tau_1:  128.332
tau_2: 128.373
tau_3: 128.383

This seemed very strange to us, but even after talking with the TAs, we couldn't figure out why this was happening; again, although we suspect something is wrong with our definition of the prior, we could not find the error (and neither could the TA). So, we decided instead to calculate the MAP as the median value for each $\tau_j$. We calculated the MAPs for $\tau_j$ for each of our five models so that we would be able to plot all five on top of the data.

In [30]:
df1_map = df_1[(df_1.beta_ind==0) & (df_1.lnlike != -np.inf)]
df2_map = df_2[(df_2.beta_ind==0) & (df_2.lnlike != -np.inf)]
df4_map = df_4[(df_4.beta_ind==0) & (df_4.lnlike != -np.inf)]
df5_map = df_5[(df_5.beta_ind==0) & (df_5.lnlike != -np.inf)]

tau_map1 = np.array([df1_map['tau_1'].median()])
tau_map2 = np.array([df2_map['tau_1'].median(), df2_map['tau_2'].median()])
tau_map3 = np.array([df3_map['tau_1'].median(), df3_map['tau_2'].median(),
                     df3_map['tau_3'].median()])
tau_map4 = np.array([df4_map['tau_1'].median(), df4_map['tau_2'].median(),
                     df4_map['tau_3'].median(), df4_map['tau_4'].median()])
tau_map5 = np.array([df5_map['tau_1'].median(), df5_map['tau_2'].median(),
                     df5_map['tau_3'].median(), df5_map['tau_4'].median(), 
                     df5_map['tau_5'].median()])

# Print the MAPs
print("""Most probable parameter value for m = 1:
tau_1: {0:.3f} seconds
""".format(tau_map1[0]))

# Print the MAPs
print("""Most probable parameter values for m = 2:
tau_1: {0:.3f} seconds
tau_2: {1:.3f} seconds
""".format(tau_map2[0], tau_map2[1]))

# Print the MAPs
print("""Most probable parameter values for m = 3:
tau_1: {0:.3f} seconds
tau_2: {1:.3f} seconds
tau_3: {2:.3f} seconds
""".format(tau_map3[0], tau_map3[1], tau_map3[2]))

# Print the MAPs
print("""Most probable parameter values for m = 4:
tau_1: {0:.3f} seconds
tau_2: {1:.3f} seconds
tau_3: {2:.3f} seconds
tau_4: {3:.3f} seconds
""".format(tau_map4[0], tau_map4[1], tau_map4[2], tau_map4[3]))

# Print the MAPs
print("""Most probable parameter values for m = 5:
tau_1: {0:.3f} seconds
tau_2: {1:.3f} seconds
tau_3: {2:.3f} seconds
tau_4: {3:.3f} seconds
tau_5: {4:.3f} seconds
""".format(tau_map5[0], tau_map5[1], tau_map5[2], tau_map5[3], tau_map5[4]))
Most probable parameter value for m = 1:
tau_1: 380.652 seconds

Most probable parameter values for m = 2:
tau_1: 178.752 seconds
tau_2: 202.140 seconds

Most probable parameter values for m = 3:
tau_1: 93.521 seconds
tau_2: 124.059 seconds
tau_3: 163.106 seconds

Most probable parameter values for m = 4:
tau_1: 5.825 seconds
tau_2: 81.428 seconds
tau_3: 118.971 seconds
tau_4: 172.587 seconds

Most probable parameter values for m = 5:
tau_1: 1.798 seconds
tau_2: 4.553 seconds
tau_3: 79.827 seconds
tau_4: 119.713 seconds
tau_5: 173.011 seconds

At this point we noticed something pretty interesting: the most probable values of $\tau_j$ look like they sum to approximately the same value regardless of the model. In fact, this value turns out to be the mean observed catastrophe time:

In [31]:
print("""MAP of tau for m = 1: {0:.3f} seconds""".format(tau_map1[0]))
print("""Sum of MAPs of taus for m = 2: {0:.3f} seconds""".format(np.sum(tau_map2)))
print("""Sum of MAPs of taus for m = 3: {0:.3f} seconds""".format(np.sum(tau_map3)))
print("""Sum of MAPs of taus for m = 4: {0:.3f} seconds""".format(np.sum(tau_map4)))
print("""Sum of MAPs of taus for m = 5: {0:.3f} seconds""".format(np.sum(tau_map5)))
print("""
Mean catastrophe time for 12 µM tubulin: {0:.3f} seconds""".format(df['12 uM'].mean()))
MAP of tau for m = 1: 380.652 seconds
Sum of MAPs of taus for m = 2: 380.892 seconds
Sum of MAPs of taus for m = 3: 380.686 seconds
Sum of MAPs of taus for m = 4: 378.811 seconds
Sum of MAPs of taus for m = 5: 378.901 seconds

Mean catastrophe time for 12 µM tubulin: 380.554 seconds

We hadn't thought about whether to expect this beforehand, but we guess it could make sense that the total time for all of the subproccesses should add up to the mean catastrophe time, if we assume that each subprocess does not start until the previous one has completed. We think this is essentially what we assumed when setting up our mathematical model, since we defined the inverse rate for each subprocess ($\tau_j$) as the difference between the time of the previous process/event and the time of the next process.

Next we wanted to plot our theoretical models as cumulative distribution functions on top of the data. To do so, we first defined a modified log likelihood function that calculates the likelihoods for individual catastrophe times and stores them in an array, rather than calculating a single likelihood for a set of catastrophe times.

In [32]:
def log_likelihood_t(tau, t, m):
    """
    Takes in tau (should be an array of length m), 
    data (Pandas data series), various values for m. 
    Returns the log_likelihoods for all datapoints 
    in an array.
    """   
    # set up array to store the log likelihoods calculated for each
    # value of t
    log_like = np.empty((len(t), 1))
    
    num_handler = np.empty((len(t), m)) # the numerator of the scaling factor
    exp_handler = np.empty((len(t), m)) # the exponent part of the numerators
    den_elem = np.empty(m) # temporary array to store elements of the denominator 
                           # of the scaling factor
    den_handler = np.empty((len(t), m)) # denominator of the scaling factor
    
    # iterate over j for all data points 
    for j in range(m):
        num_handler[:,j] = tau[j]**(m-2)
        exp_handler[:,j] = -t/tau[j]
        den_elem = tau[j] - tau
        # use only non-zero elements when taking the product 
        # in the denominator of the scaling factor 
        den_prod = np.prod(den_elem[:j]) * np.prod(den_elem[j+1:])
        den_handler[:,j] = den_prod 
        den_elem = np.empty(m)

    scaling_factor = num_handler / den_handler

    # store the log likelihoods in an array 
    log_like = scipy.misc.logsumexp(exp_handler, axis=1, b=scaling_factor)            

    # return array of the log likelihoods
    return log_like

Then we defined our posterior distribution as the exponentiated sum of the log likelihoods and log priors. To convert the posterior distributions into CDFs, we normalized each by the area under the curve and calculated the cumulative sums.

In [33]:
def post_plot(tau, t, m, tau_min, tau_max):
    return np.exp(log_likelihood_t(tau, t, m) + log_prior(tau, m, tau_min, tau_max))

post1 = post_plot(tau_map1, np.arange(1800), 1, 1, 1800)
post1_area = trapz(post1, dx=1)
post1_norm = post1 / post1_area
post1_cdf = np.cumsum(post1_norm)

post2 = post_plot(tau_map2, np.arange(1800), 2, 1, 1800)
post2_area = trapz(post2, dx=1)
post2_norm = post2 / post2_area
post2_cdf = np.cumsum(post2_norm)

post3 = post_plot(tau_map3, np.arange(1800), 3, 1, 1800)
post3_area = trapz(post3, dx=1)
post3_norm = post3 / post3_area
post3_cdf = np.cumsum(post3_norm)

post4 = post_plot(tau_map4, np.arange(1800), 4, 1, 1800)
post4_area = trapz(post4, dx=1)
post4_norm = post4 / post4_area
post4_cdf = np.cumsum(post4_norm)

post5 = post_plot(tau_map5, np.arange(1800), 5, 1, 1800)
post5_area = trapz(post5, dx=1)
post5_norm = post5 / post5_area
post5_cdf = np.cumsum(post5_norm)

Finally we plotted the empirical CDF (the data) with the theoretical CDFs.

In [34]:
# Build ECDF
y = np.arange(len(df['12 uM'])) / len(df['12 uM'])
x = np.sort(df['12 uM'].values)

# Plot ECDF and theoretical CDFs for different values of m
plt.plot(np.arange(1800), post1_cdf, color='blue')
plt.plot(np.arange(1800), post2_cdf, color='green')
plt.plot(np.arange(1800), post3_cdf, color='orange')
plt.plot(np.arange(1800), post4_cdf, color='gray')
plt.plot(np.arange(1800), post5_cdf, color='red')
plt.plot(x, y, '.', color='k')
plt.xlabel('Catastrophe time (s)', fontsize=14)
plt.ylabel('Cumulative probability', fontsize=14)
plt.title('Empirical and theoretical cumulative distributions of catastrophe times',
          fontsize=16)
plt.legend(['Theoretical curve (m = 1)', 'Theoretical curve (m = 2)',
            'Theoretical curve (m = 3)', 'Theoretical curve (m = 4)', 
            'Theoretical curve (m = 5)', 'Data for 12 µM tubulin'], loc=4)
plt.tight_layout()

As expected from the odds ratios, $m = 3$, $m = 4$, and $m = 5$ give the best fits (they are so similar, in fact, that they obscure each other). These three models look pretty close to the empirical CDF, so our model and MAPs seem reasonable despite the trouble we had to go through to get them.

Part D

First let's remind ourselves which concentrations of tubulin were tested.

In [4]:
df.head()
Out[4]:
12 uM 7 uM 9 uM 10 uM 14 uM
0 25.000 35 25 50 60
1 40.000 45 40 60 75
2 40.000 50 40 60 75
3 45.429 50 45 75 85
4 50.000 55 50 75 115

We wanted to do parameter estimation for the other tubulin concentrations using emcee's Ensemble sampler, since we are only interested in the posterior distributions and don't need to perform model selection again. We defined the functions below in order to run the Ensemble sampler.

In [5]:
def log_posterior(p, t, m=3, tau_min=1, tau_max=1800):
    """
    Calculates the log posterior using our best model.
    """
    tau_1, tau_2, tau_3 = p
    
    # check that tau is ordered least to greatest
    if not (tau_1 + 1e-6) < tau_2 and (tau_2 + 1e-6) < tau_3:
        return -np.inf
    
    # check that tau is not outside the boundaries we define
    if tau_1 < tau_min or tau_3 > tau_max:
        return -np.inf
    
    return log_likelihood(p, t, m) + log_prior(p, m, tau_min, tau_max)

def sample_mcmc(data, n_walkers=100, n_burn=100, 
                  n_steps=500, threads=None):
    """
    Sample posterior using MCMC. Written for use
    with our best model.
    """
    p0 = np.empty((300, n_walkers, 3))
    p0[:,:,0] = np.random.exponential(300, (n_walkers, 3)) 
    p0[:,:,1] = np.random.exponential(300, (n_walkers, 3)) 
    p0[:,:,2] = np.random.exponential(300, (n_walkers, 3)) 

    columns = ['tau_1', 'tau_2', 'tau_3'] 

    args = (data,)

    return bebi103.run_ensemble_emcee(log_posterior, n_burn=n_burn, 
                                      n_steps=n_steps, n_walkers=n_walkers, 
                                      p0=p0, columns=columns, 
                                      args=args, threads=3)                    

Unfortunately, when we attempted to run the Ensemble sampler with these functions, we kept getting Nans for lnprob, which caused emcee to throw an error. We couldn't figure out the reason for this, but we expect it is tied to the same problem we experienced with PTMCMC in part (c). In the end, we decided to just go ahead and run PTMCMC here too, since we knew how to circumvent this issue as long as we use the PT sampler.

In [ ]:
df_7uM, lnZ_7uM, dlnZ_7uM = sample_ptmcmc(df['7 uM'].dropna(), 3, 'm = 3', 
                                    n_temps=20, n_walkers=100, 
                                    n_burn=1000, n_steps=5000, threads=3)

df_9uM, lnZ_9uM, dlnZ_9uM = sample_ptmcmc(df['9 uM'].dropna(), 3, 'm = 3', 
                                    n_temps=20, n_walkers=100, 
                                    n_burn=1000, n_steps=5000, threads=3)

df_10uM, lnZ_10uM, dlnZ_10uM = sample_ptmcmc(df['10 uM'].dropna(), 3, 'm = 3', 
                                    n_temps=20, n_walkers=100, 
                                    n_burn=1000, n_steps=5000, threads=3)

df_14uM, lnZ_14uM, dlnZ_14uM = sample_ptmcmc(df['14 uM'].dropna(), 3, 'm = 3', 
                                    n_temps=20, n_walkers=100, 
                                    n_burn=1000, n_steps=5000, threads=3)

As usual, we first look at the results using corner plots.

In [7]:
a = corner.corner(df_7uM[(df_7uM.beta_ind==0)][['tau_1', 'tau_2', 'tau_3']], bins=50)
a.text(0.5, 1, 'PTMCMC results for 7 µM tubulin', fontsize=16)
Out[7]:
<matplotlib.text.Text at 0x109db8fd0>
In [8]:
a = corner.corner(df_9uM[(df_9uM.beta_ind==0)][['tau_1', 'tau_2', 'tau_3']], bins=50)
a.text(0.5, 1, 'PTMCMC results for 9 µM tubulin', fontsize=16)
Out[8]:
<matplotlib.text.Text at 0x10b56a710>
In [9]:
a = corner.corner(df_10uM[(df_10uM.beta_ind==0)][['tau_1', 'tau_2', 'tau_3']], bins=50)
a.text(0.5, 1, 'PTMCMC results for 10 µM tubulin', fontsize=16)
Out[9]:
<matplotlib.text.Text at 0x10b572208>
In [10]:
a = corner.corner(df_14uM[(df_14uM.beta_ind==0)][['tau_1', 'tau_2', 'tau_3']], bins=50)
a.text(0.5, 1, 'PTMCMC results for 14 µM tubulin', fontsize=16)
Out[10]:
<matplotlib.text.Text at 0x10b524128>

Although the exact shapes of the posteriors change across different tubulin concentrations, they are all peaked and unimodal.

Next, we calculated the MAPs and plotted our ECDFs and theoretical CDFs for each tubulin concentration, similar to what we did in part (c) to check that PTMCMC gave us reasonable results.

In [15]:
# for 7 µM tubulin

# restrict PTMCMC output to the first temperature and drop all rows
# that have lnlike = -inf
df7uM_map = df_7uM[(df_7uM.beta_ind==0) & (df_7uM.lnlike != -np.inf)]

# generate array of tau MAPs
tau_map_7uM = np.array([df7uM_map['tau_1'].median(), df7uM_map['tau_2'].median(),
                     df7uM_map['tau_3'].median()])

# caculate theoretical CDF
post7uM = post_plot(tau_map_7uM, np.arange(1800), 3, 1, 1800)
# drop the first value in the array because it is Nan
post7uM = post7uM[1:]
post7uM_area = trapz(post7uM, dx=1)
post7uM_norm = post7uM / post7uM_area
post7uM_cdf = np.cumsum(post7uM_norm)

df7_dropna = df['7 uM'].dropna()

# Build ECDF
y = np.arange(len(df7_dropna)) / len(df7_dropna)
x = np.sort(df7_dropna.values)

# Plot ECDF and theoretical CDFs for different values of m
plt.plot(np.arange(1799), post7uM_cdf, color='r')
plt.plot(x, y, '.', color='k')
plt.xlabel('Catastrophe time (s)', fontsize=14)
plt.ylabel('Cumulative probability', fontsize=14)
plt.title('ECDF and CDF for 7 µM tubulin catastrophe times',
          fontsize=16)
plt.legend(['Theoretical curve (m = 3)', 'Data for 7 µM tubulin'], loc=4)
plt.tight_layout()
In [16]:
# for 9 µM tubulin

# restrict PTMCMC output to the first temperature and drop all rows
# that have lnlike = -inf
df9uM_map = df_9uM[(df_9uM.beta_ind==0) & (df_9uM.lnlike != -np.inf)]

# generate array of tau MAPs
tau_map_9uM = np.array([df9uM_map['tau_1'].median(), df9uM_map['tau_2'].median(),
                     df9uM_map['tau_3'].median()])

# calculate theoretical CDF
post9uM = post_plot(tau_map_9uM, np.arange(1800), 3, 1, 1800)
post9uM_area = trapz(post9uM, dx=1)
post9uM_norm = post9uM / post9uM_area
post9uM_cdf = np.cumsum(post9uM_norm)

df9_dropna = df['9 uM'].dropna()

# Build ECDF
y = np.arange(len(df9_dropna)) / len(df9_dropna)
x = np.sort(df9_dropna.values)

# Plot ECDF and theoretical CDFs for different values of m
plt.plot(np.arange(1800), post9uM_cdf, color='r')
plt.plot(x, y, '.', color='k')
plt.xlabel('Catastrophe time (s)', fontsize=14)
plt.ylabel('Cumulative probability', fontsize=14)
plt.title('ECDF and CDF for 9 µM tubulin catastrophe times',
          fontsize=16)
plt.legend(['Theoretical curve (m = 3)', 'Data for 9 µM tubulin'], loc=4)
plt.tight_layout()
In [17]:
# for 10 µM tubulin

# restrict PTMCMC output to the first temperature and drop all rows
# that have lnlike = -inf
df10uM_map = df_10uM[(df_10uM.beta_ind==0) & (df_10uM.lnlike != -np.inf)]

# generate array of tau MAPs
tau_map_10uM = np.array([df10uM_map['tau_1'].median(), df10uM_map['tau_2'].median(),
                     df10uM_map['tau_3'].median()])

# calculate theoretical CDF
post10uM = post_plot(tau_map_10uM, np.arange(1800), 3, 1, 1800)
post10uM_area = trapz(post10uM, dx=1)
post10uM_norm = post10uM / post10uM_area
post10uM_cdf = np.cumsum(post10uM_norm)

df10_dropna = df['10 uM'].dropna()

# Build ECDF
y = np.arange(len(df10_dropna)) / len(df10_dropna)
x = np.sort(df10_dropna.values)

# Plot ECDF and theoretical CDFs for different values of m
plt.plot(np.arange(1800), post10uM_cdf, color='r')
plt.plot(x, y, '.', color='k')
plt.xlabel('Catastrophe time (s)', fontsize=14)
plt.ylabel('Cumulative probability', fontsize=14)
plt.title('ECDF and CDF for 10 µM tubulin catastrophe times',
          fontsize=16)
plt.legend(['Theoretical curve (m = 3)', 'Data for 10 µM tubulin'], loc=4)
plt.tight_layout()
In [18]:
# for 14 µM tubulin

# restrict PTMCMC output to the first temperature and drop all rows
# that have lnlike = -inf
df14uM_map = df_14uM[(df_14uM.beta_ind==0) & (df_14uM.lnlike != -np.inf)]

# generate array of tau MAPs
tau_map_14uM = np.array([df14uM_map['tau_1'].median(), df14uM_map['tau_2'].median(),
                     df14uM_map['tau_3'].median()])

# caculate theoretical CDF
post14uM = post_plot(tau_map_14uM, np.arange(1800), 3, 1, 1800)
post14uM_area = trapz(post14uM, dx=1)
post14uM_norm = post14uM / post14uM_area
post14uM_cdf = np.cumsum(post14uM_norm)

df14_dropna = df['14 uM'].dropna()

# Build ECDF
y = np.arange(len(df14_dropna)) / len(df14_dropna)
x = np.sort(df14_dropna.values)

# Plot ECDF and theoretical CDFs for different values of m
plt.plot(np.arange(1800), post14uM_cdf, color='r')
plt.plot(x, y, '.', color='k')
plt.xlabel('Catastrophe time (s)', fontsize=14)
plt.ylabel('Cumulative probability', fontsize=14)
plt.title('ECDF and CDF for 14 µM tubulin catastrophe times',
          fontsize=16)
plt.legend(['Theoretical curve (m = 3)', 'Data for 14 µM tubulin'], loc=4)
plt.tight_layout()

As we expected, all of the theoretical CDFs look like quite reasonable models with respect to the ECDFs.

Next we wanted to plot $\tau_j$ against tubulin concentration.

In [36]:
# put the tau MAPs for all tubulin concentrations in a single array
tau_map_all = np.array([tau_map_7uM, tau_map_9uM, tau_map_10uM, tau_map3, tau_map_14uM])

# define the array of tubulin concentrations
concentrations = np.array([7, 8, 10, 12, 14])

# also find the sum of tau_j for each concentration and store in an array
tau_map_sum = np.sum(tau_map_all, axis=1)

# plot tau_j vs tubulin concentration
plt.plot(concentrations, tau_map_all[:,0], marker='o')
plt.plot(concentrations, tau_map_all[:,1], marker='o')
plt.plot(concentrations, tau_map_all[:,2], marker='o')
plt.plot(concentrations, tau_map_sum, marker='o')
plt.legend([r'$\tau_1$', r'$\tau_2$', r'$\tau_3$', r'Sum of $\tau_j$'], loc=4)
plt.xlabel('Concentration of tubulin (µM)')
plt.ylabel(r'$\tau_j$ (seconds)')
plt.title(r'$\tau_j$ vs. tubulin concentration')
Out[36]:
<matplotlib.text.Text at 0x10e6c9ac8>

With the exception of the 7 µM tubulin condition, there seems to be a general trend of increasing values for $\tau_j$ as tubulin concentration increases. This at first seemed counterintuitive given that microtubules polymerize faster when the concentration of tubulin is higher. We naively thought that faster polymerization would mean that the microtubules would more quickly reach some threshold length beyond which catastrophes become more likely.

However, in the course of reading online about microtubules, we came across some information that seems to account for the observed phenomenon of increasing time to catastrophe as tubulin concentration increases. According to a webpage for a cell biology course at Rensselaer Polytechnic Institute,$^1$ the growing (plus) end of a microtubule is stabilized by a GTP cap. This cap forms because while b-tubulin hydrolyzes GTP when it comes into contact with a-tubulin (during/as a result of polymerization), this hydrolysis takes time and is outpaced by polymerization at high tubulin concentrations. The faster the tubulin polymerizes, the more of a stabilizing GTP cap it can build, and thus catastrophe becomes more infrequent. Conversely, as the tubulin concentration decreases and polymerization slows down, the probability increases that GTP will be hydrolyzed before the next tubulin dimer is added. Hydrolysis of GTP destabilizes the microtubule; loss of the GTP cap can initiate a catastrophe.

$^1$ https://www.rpi.edu/dept/bcbp/molbiochem/MBWeb/mb2/part1/microtub.htm

In [ ]: